Overview

Today we will focus on making progress on the data for your final projects. By today (ideally) or Sunday at 5pm at the latest, you will have a tidy data set that contains all of the variables you need to proceed with your analysis for your final project. You will upload the compiled html file from this .Rmd – which will count both for your lab grade and your Assignment 3 grade.

Pick one person to be the designated coder/writer for this assignment. Do all your work on their computer.

Grades for this assignment are designed to help you know where your group is at:

  • 100: On track, nothing left to do

  • 90: Some minor changes, additional work to do

  • 80: Major changes, errors, lots of work to do to stay on track.

Your document should contain the following code and content:

Code to:

  • Set up your workspace

  • Load your data (or datasets) into R

  • Recode, clean, merge, and transform your data as needed.

  • Produce a table of summary statistics

  • Produce at least one (and possibly more) informative descriptive figures

Content which:

  • An overview of your data

    • What is the source of your data set(s)

    • What is the unit of observation?

    • How many observations?

  • Provides a code book identifying your

    • Outcome variable(s)
    • Key predictor(s)
    • Additional covariates
  • A brief substantive description of your descriptive statistics: What does a typical observation in your data set look like.

  • A list of next steps and/or outstanding questions or goals. For example:

    • Clarifying theoretical framework and expectations

    • Specifying linear models to test research question

    • Fitting and interpreting linear models

    • Gathering additional data

    • Producing particular figures (maps, faceted plots)

Below, I’ve integrated these tasks into what I think is a reasonable workflow – so you’ll be alternating between code and content as you progress with this assignment.

1 Set up workspace

Use code from previous labs and class. At a minimum, you’ll want to load the packages of the tidyverse and maybe something like haven for reading data.

# Set up workspace

2 Load Data

This will vary for each group. I believe most of you are loading data directly from the web. If you’re loading data stored locally, you’ll need to write code to set the working directory to where your data and .Rmd file are saved so that R can find the data.

# Load data

3 Codebook

Once you’ve loaded your data, create a small codebook outlining the following:

  • Dependent variable:
    • Conceptual description
    • Question wording (If relevant): Copy from codebook/questionnaire
    • raw variable name: Copy from codebook
    • recoded variable name: Define yourself
    • Values/Range: (e.g. 18 - 99+ years, 0 = non-voter, 1 = voter)
  • Key Predictor(s):
    • Conceptual description
    • Question wording (If relevant): Copy from codebook/questionnaire
    • raw variable name: Copy from codebook
    • recoded variable name: Define yourself
    • Values/Range: (e.g. 18 - 99+ years, 0 = non-voter, 1 = voter)
  • Covariates:
    • Conceptual description
    • Question wording: Copy from codebook/questionnaire
    • raw variable name: Copy from codebook
    • recoded variable name: Define yourself
    • Values/Range: (e.g. 18 - 99+ years, 0 = non-voter, 1 = voter)

The value/range should describe the values each variable –once you’ve recoded it – can take. You may need to look at the data using commands like table() and summary() to clarify this.

4 Clean Data

  • Use the mutate() command in combination with

    • case_when() to recoded categorical variables using logical indexing.

    • ifelse() to recode binary variables. Also useful for recoding values that should be NA

    • Remember to save the output of your recoding back into the data set

  • If you’re working with multiple datasets, you’ll need to merge them together using left_join()

    • Use the by=c("var1" = "var2") argument to merge dataset 1 with dataset 2 using var1 in dataset 1 and var2 in dataset today.

    • Make sure that the values in the variables you merge by match up. If you’re merging together state level data, make sure that both datasets spell each state name exactly the same way (e.g. you don’t want one data set to have “D.C.” and another to have “District of Columbia”)

    • Save the output into a temporary data frame. Check the dimensions. The rows should equal the number of rows in your main (final) data set. The columns will include the additional unique variables. Merge in additional datasets, each time creating a temporary data frame, to check the results of your merge. When you’re satisfied, save this data frame into a new obejct that will be the data frame you use for your analysis.

# Recode

5 Describe your data

Once you’ve recoded your data:

  • Create a table of summary statistics for your outcome, key predictor(s), and covariates

  • Present at least one descriptive figure, that illustrates the distribution of your outcome or key predictor, or shows an interesting relationship between variables.

  • Interpret your results

5.1 Summary statistics:

Producing a table of summary statistics requires a little foresight.

Essentially you want to make a data frame where each row is a (numeric) variable, and each column is a statistic (minimum, 25th percentile, median, mean, 75th percentile, max, Number of missing).

To do this, I would:

  • create a object called the_vars which contains the names (in quotation marks) of the variables you want to summarize.

  • Select these variables from your data set. using df%>%select(all_of(the_vars))

  • Use %>%pivot_wider() specifying cols=select(all_of(the_vars)), and names_to equals "Variable" and values_to = "value" to transform this wide dataset into a long dataset

  • Then use %>%group_by(Variable)%>% and summarise() to calculate the statistics for each variable of interest (e.g. %>%summarise(Mean = mean(value, na.rm=T))))

  • Save the output to an object called something like sum_df

  • In a new chunk use knitr::kable(sum_df) %>% kableExtra::kable_styling() to format your table. Set echo=F in the code chunk head

# Summarise data

5.2 Descriptive Figures

To create a figure, you’ll need to specificy the following

  • data (e.g. df %>%)

  • aesthetic mappings, ggplot(aes(x = predictor, y = outcome))

  • geometries

    • Univariate: geom_density(), geom_boxplot() geom_histogram()

    • Bivariate: geom_point() (for a scatterplot), geom_line() for a trend.

Once you have a minimal working example, play around with other grammars of graphics:

  • labs() for custom labels

  • theme_XXX for custom themes

  • facet_wrap(~group) to produce the same plot facetted by some categorical grouping variable

When you’re happy with your figure, save it as object in R (e.g. fig1 <- df %>% ggplot(aes(predictor, outcome))+geom_point()). Put that object in its own chunk to display it in your document.

Don’t let the perfect be the enemy of the good.

# Descriptive figures

5.3 Descrptive Interpretation:

Please provide an overview of the data (source, number of observations, unit of analysis).

Describe a typical observation, making reference to the statistics in your summary table.

Offer a substantive interpretation of your descriptive figure(s). What do they tell us about the distribution of a key variable, or the relationship between two variables.

6 Next Steps/Questions

Use this section to outline next steps for your group and assign tasks and responsibilities. If you have any specific questions /requests/things I can provide help with, please let me know.